src/AllCells/[RC17]Normalization.Rmd · MonolayerCultures

---
title: '[RC17] Normalization'
author: "Nina-Lydia Kazakou"
date: "03/03/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Set-up
### Load libraries
```{r message=FALSE, warning=FALSE}
library(SingleCellExperiment)
library(Seurat)
library(tidyverse)
library(Matrix)
library(scales)
library(cowplot)
library(RCurl)
library(scran)
library(Matrix)
library(ggplot2)
library(edgeR)
library(dplyr)
library(ggsci)  
library(here)
```

### Colour Palette 
```{r load_palette}
mypal <- pal_npg("nrc", alpha = 0.7)(10)
mypal2 <-pal_tron("legacy", alpha = 0.7)(7)
mypal3<-pal_lancet("lanonc", alpha = 0.7)(9)
mypal4<-pal_simpsons(palette = c("springfield"), alpha = 0.7)(16)
mypal5 <- pal_rickandmorty(palette = c("schwifty"), alpha = 0.7)(6)
mypal6 <- pal_futurama(palette = c("planetexpress"), alpha = 0.7)(5)
mypal7 <- pal_startrek(palette = c("uniform"), alpha = 0.7)(5)
mycoloursP<- c(mypal, mypal2, mypal3, mypal4)
```

### Load sce or take from environment
```{r}
filter_sce <- readRDS(here("data", "Processed", "AllCells", "RC17_scaterQC_sce.rds"))
```


# Normaliztion (with scran package)

*[METHOD 1] Library size normalization:* calculate size factors just taking in consideration the library size of each individual cell
Library size: the total sum of counts across all genes for each cell, the expected value of which is assumed to scale with any cell-specific biases.
The library size factor for each cell is then directly proportional to its library size where the proportionality constant is defined such that the mean size factor across all cells is equal to 1.

```{r}
# lib.sf <- librarySizeFactors(filter_sce)
# summary(lib.sf)
```

```{r}
# hist(log10(lib.sf), xlab="Log10[Size factor]", col='grey80')
```

The use of library size factors assumes that there is no imbalance in the differentially expressed (DE) genes between any pair of cells.
This means that any upregulation for a subset of genes is cancelled out by the same magnitude of downregulation in a different subset of genes.
However, balanced DE is not generally present in scRNA-seq applications, which means that library size normalization may not yield accurate normalized expression values for downstream analyses!


So, here I will use:

*METHOD 2: Normalization by deconvolution:* use pools of cells to calculate the size factors 
The approximation relies on stochastic initialization so we need to set the random seed for reproducibility.

```{r}
set.seed(100)
```

Here, the cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters.
This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations.

```{r}
# Create the pools of cells with quick cluster and save them in clust_filter_sce
clust_filter_sce <- quickCluster(filter_sce) 
table(clust_filter_sce)
```

I can also compute the sizefactors & save them on a separate object (optional).
Use of the deconvolution size factors adjusts for these biases to improve normalization accuracy for downstream applications.
```{r}
# Compute & save sizefactos
#deconv.sf <- calculateSumFactors(filter_sce, cluster=clust_filter_sce) 
#plot(lib.sf, deconv.sf, xlab="Library size factor",
#     ylab="Deconvolution size factor", log='xy', pch=16,
#     col=as.integer(factor(filter_sce$Sample)))
#abline(a=0, b=1, col="red")
```

I will compute the normalized expression values for each cell by dividing the count for each gene transcript with the appropriate size factor for that cell and add it directly to the object meta.data:
```{r}
norm_sce <- filter_sce
norm_sce <- computeSumFactors(norm_sce, cluster = clust_filter_sce, min.mean = 0.1)
deconv.sf <- sizeFactors(norm_sce) #visualize the size factors
summary(deconv.sf) # Check that there are no 0 or negative size factors; if it does change the seed and recompute them
```


Now, I will apply the normalization through the size factors.
```{r}
norm_sce<- logNormCounts(norm_sce)
assayNames(norm_sce)
head(sizeFactors(norm_sce))
```

```{r}
saveRDS(norm_sce, here("data", "Processed", "AllCells", "RC17_norm_sce.rds"))
```

## Quantifying per gene variation
Simply compute the variance of the log-normalised expression values (i.e. logcounts) for each gene across all cells in the population.
Genes with the largest variances in log-values will contribute the most to the Euclidean distances between cells.
By using log-values here, we ensure that our quantitative definition of heterogeneity is consistent throughout the entire analysis.

The log-transformation does not achieve perfect variance stabilization, which means that the variance of a gene is driven more by its abundance than its underlying biological heterogeneity.
Fit a trend to the variance with respect to abundance across all genes:
```{r}
dec_norm_sce <- modelGeneVar(norm_sce)
```

```{r}
fit.norm_sce <- metadata(dec_norm_sce)
plot(fit.norm_sce$mean, fit.norm_sce$var, xlab="Mean of log-expression",
     ylab="Variance of log-expression")
curve(fit.norm_sce$trend(x), col="dodgerblue", add=TRUE, lwd=2)
```

Genes with negative biological components (bio) have no obvious interpretation and can be mostly ignored.
```{r}
dec_norm_sce[order(dec_norm_sce$bio, decreasing=TRUE),]

write.csv(dec_norm_sce, here("outs", "Processed", "AllCells", "GeneVariation_after_ScranNormalization.csv"))
```


## Quantifying technical noise
Attempt to create a trend by making some distributional assumptions about the noise.

UMI counts typically exhibit near-Poisson variation if we only consider technical noise from library preparation and sequencing. This can be used to construct a mean-variance trend in the log-counts:
```{r}
set.seed(0010101)
dec_poi_norm_sce <- modelGeneVarByPoisson(norm_sce)
dec_poi_norm_sce <- dec_poi_norm_sce[order(dec_poi_norm_sce$bio, decreasing=TRUE),]
head(dec_poi_norm_sce)
```

```{r}
plot(dec_poi_norm_sce$mean, dec_poi_norm_sce$total, pch=16, xlab="Mean of log-expression",
     ylab="Variance of log-expression")
curve(metadata(dec_poi_norm_sce)$trend(x), col="dodgerblue", add=TRUE)
```

However, more accurate noise model do not necessarily yield a better ranking of HVGs.

```{r}
sessionInfo()
```